Skip to content

Conversation

@ErwinTerpstra
Copy link
Contributor

@ErwinTerpstra ErwinTerpstra commented Nov 26, 2025

Proposed changes

Add support for grouped gemm Tile Loop instances on RDNA4. This PR contains:

  • Device struct for grouped gemm Multiple D using the Tile Loop algorithm (DeviceGroupedGemmMultipleD_Wmma_CShuffle_TileLoop_V3)
    • Common code with the XDL implementation has been refactored out to be shared between the two variants
  • Instance variants matching the XDL instances
  • Profiler implementation profile_grouped_gemm_tile_loop_generic_impl which supports all variants.
    • This consolidates the previously separate profile_grouped_gemm_tile_loop_impl and profile_grouped_gemm_tile_loop_multiply_generic_impl. The interfaces are kept for compatibility and redirect to the new generic implementation.
  • Tests for all tile loop vanilla and Multiple D tile loop instances.
    • The same test fixture is used for all variants.
    • Note that XDL instances were previously not covered by tests, but are included in these new tests.
  • Other changes:
    • Fixed an issue with the MultiplyAdd operator where the calculation was performed in data type of the E tensor, instead of the accumulator data type, causing precision issues.
    • Added tuple tuple_element_or_t helper to the general tuple header. This allows retrieving a type element from the tuple, or default to a specific type if the index is out of range

Note that this PR is dependent on this: #3303

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
    • N/A
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
    • N/A
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

@ErwinTerpstra ErwinTerpstra force-pushed the eterpstr/175-implement-device_grouped_gemm_tile_loop-for-rdna4 branch from b66874d to 44cc94d Compare January 8, 2026 09:51
@ErwinTerpstra ErwinTerpstra merged commit eb04107 into ROCm:develop Jan 13, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants